Pierre-Luc Germain - 25 August, 2017
library(smallRNA)
Assuming we have already removed adapter sequences, we first collapse the fastq files:
fastqFiles <- list.files(pattern="fastq\.gz$")
collapseFastq(fastqFiles)
We then generate the sequence counts matrix:
seqcountsFiles <- list.files(pattern="seqcounts$")
collapsed2countMatrix(seqcountsFiles, "seqcounts.matrix")
We extract the sequences and align them to the custom genome:
seqcounts2fasta("seqcounts.matrix", "seqs.fasta")
smallRNAexp_align("seqs.fasta", "aligned.bam", nthreads=4)
The alignment procedure is based on a customized genome that includes, on top of the normal (in this case mm10) genome, additional sequences of ‘modified RNAs’, such as isomirs from canonical miRNAs and post-transcriptionally modified tRNAs. This therefore combines the advantages of both a genome-based alignment with an alignment on known transcripts. One advantage of this method is that post-transcriptional modifications, instead of being a roadblock to alignment and hence sequence identification, is actually used to distinguish sequences. One consequence is that many reads will align to multiple locations, but this ‘synonymy’ is handled through feature nesting (see below), which is anyway needed given the frequent multiple alignment of small RNAs.
This alignment wrapper function works in two steps, and on an in. First, it attempts to align end-to-end reads (adapter trimming should have been performed beforehand) without mismatches using bowtie1, accepting multiple similar-quality hits (–best –strata -m 1000). Reads that fail to align are then aligned with STAR to allow for soft clipping. Alignment is currently not stranded, which might be preferable given the reported incompleteness/inconsistencies in strand-specificity of Illumina small RNA protocols, but we could eventually try to make use of this information in later versions.
Once this double alignment procedure is over, we parse the alignements and extract all potential source genomic features for each sequences (requires a prepared elements bedfile):
smallRNAexp_parseBam("aligned.bam", "elements.bed", outputfile="seqs.srcs")
Any interesction (of at least 50% of the read) with any genomic feature is reported, and will be considered for sequence assignment (below).
Before creating the object, we’ll need to construct a data.frame of sample annotation (i.e. phenoData). The row names need to correspond to column names of the seqcounts matrix.
pd <- data.frame(row.names=colnames(read.delim("seqcounts.matrix",header=T,row.names=1,nrows=2)), group=rep(c("CTRL","MSUS"),c(4,3)))
pd
## group
## CTRL1_GSM1214556 CTRL
## CTRL2_GSM1214557 CTRL
## CTRL3_GSM1214551 CTRL
## CTRL4_GSM1214552 CTRL
## MSUS1_GSM1214555 MSUS
## MSUS2_GSM1214553 MSUS
## MSUS3_GSM1214554 MSUS
Now we can build the smallRNAexp object:
o <- smallRNAexp("seqcounts.matrix", "seqs.srcs", pd, features2meta="~/bioinfo/mansuy/anno/ensembl2mirbase")
We’ll also perform the default normalization across samples:
o <- calcNormFactors(o)
calcNormFactors is named this way because it relies on edgeR’s calcNormFactors function, which means that it implements all the same normalization method (default: TMM). In addition, it also implements the GC-aware conditional quantile normalization (more on normalization later).
When sequences are ambiguous, i.e. potentially originating from more than one loci and/or genomic feature, sequence assignment is done using feature nesting. In addition, assignment rules are used which can be customized via the rules argument of the smallRNAexp function. To illustrate this, let us consider two examples.
The sequence ‘AAAAGCTGGGTTGAGAGGGCG’ aligns to the Mir320 locus on chr14, as well as to the mmu-miR-320-3p mature miRNA, with complete overlap. Since both feature actually represent the same miRNA (the correspondence being established through biomart’s ID conversion), and the read has the length required for miRNA (defined through the customizable rules), the read is certain to be assigned to this feature. Since the read overlaps both features equally (100% of the read within the feature), the feature is assigned to the most specific of the two, i.e. mmu-miR-320-3p. Had the overlap been imperfect with that feature (e.g. only 19 of the reads’ 21 nucleotides overlapping the feature), it would instead been assigned to the more general feature (converted to mirBase ID, i.e. mmu-miR-320).
The sources slot of the object shows the read assignment:
o@sources["AAAAGCTGGGTTGAGAGGGCG",]
## location cigar src_name src_type overlap
## AAAAGCTGGGTTGAGAGGGCG ambiguous 21M mmu-mir-320-3p miRNA 21
## length status
## AAAAGCTGGGTTGAGAGGGCG 21 unique
The read ‘AACATTCATTGCTGTCGGTGGG’ aligns to multiple locations:
getPotentialSources(o,"AACATTCATTGCTGTCGGTGGG", srcsFile="seqs.srcs")
## sequence cigar location pos_in_feature
## 1 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 11
## 2 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 11
## 3 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 119
## 4 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 119
## 5 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 2449
## 6 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 38261
## 7 AACATTCATTGCTGTCGGTGGG 22M chr1:137966650 68052
## 8 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 113302
## 9 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 130471
## 10 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 130475
## 11 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 15
## 12 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 15
## 13 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 867
## 14 AACATTCATTGCTGTCGGTGGG 22M chr2:38853845 867
## 15 AACATTCATTGCTGTCGGTGGG 22M mmu-miR-181b-5pA:1 0
## 16 AACATTCATTGCTGTCGGTGGG 22M mmu-miR-181b-5pC:1 0
## 17 AACATTCATTGCTGTCGGTGGG 22M mmu-miR-181b-5pG:1 0
## 18 AACATTCATTGCTGTCGGTGGG 22M mmu-miR-181b-5pT:1 0
## feature_length src_id src_name src_type overlap
## 1 79 ENSMUSG00000065458.1 Mir181b-1 miRNA 22
## 2 79 ENSMUSG00000065458.1 Mir181b-1 precursor 22
## 3 3387 ENSMUSG00000098206.1 A430106G13Rik lincRNA 22
## 4 3387 ENSMUSG00000098206.1 A430106G13Rik precursor 22
## 5 5014 ENSMUSG00000098145.1 Gm26936 precursor 22
## 6 40851 ENSMUSG00000098145.1 Gm26936 precursor 22
## 7 101786 ENSMUSG00000098243.3 Gm4258 precursor 22
## 8 185732 ENSMUSG00000063972.13 Nr6a1 precursor 22
## 9 202843 ENSMUSG00000063972.13 Nr6a1 precursor 22
## 10 204318 ENSMUSG00000063972.13 Nr6a1 precursor 22
## 11 88 ENSMUSG00000065578.1 Mir181b-2 miRNA 22
## 12 88 ENSMUSG00000065578.1 Mir181b-2 precursor 22
## 13 1959 ENSMUSG00000107884.1 Gm44144 precursor 22
## 14 1959 ENSMUSG00000107884.1 Gm44144 TEC 22
## 15 22 mmu-miR-181b-5pA mmu-miR-181b-5p miRNA 22
## 16 22 mmu-miR-181b-5pC mmu-miR-181b-5p miRNA 22
## 17 22 mmu-miR-181b-5pG mmu-miR-181b-5p miRNA 22
## 18 22 mmu-miR-181b-5pT mmu-miR-181b-5p miRNA 22
The locations reported here are ‘raw’, and include multiple copies (for instance isomirs with 5’ nucleotide addition, or the same feature with different names) as well as introns. Introns (with the exception of those from piRNA precursors) will be discarded if there is a processed feature that has an equivalent overlap with the read. Features that go by different names will be collapsed. This leaves us with two ‘real’ locations.
The alignment on chr1 overlaps entirely with two different features: the A430106G13Rik lincRNA, and the Mir181b-1 gene. Since in both cases the overlap is 22 nucleotides, this information cannot be used to discriminate the potential sources. Given the tie, the outcome will depend on another customizable rule, the highPriorityTypes: since miRNA are among the highPriorityTypes, while lincRNA are not, the miRNA will be selected as the most likely source (I am currently working on a expectation-maximization method to break these ties on the basis of the coverage across the rest of each element).
Once this is performed at each location, the assignments from the different locations are collapsed using the same rules. In this case, since they all encode for the same miRNA, the sequence is assigned to that miRNA, again using the most specific non-ambiguous one (mmu-miR-181b-5p). Had these been different miRNAs, the src_type would still be said to be miRNA, but the status would be ambiguous, and the src_name would indicate both potential sources; had there been an ambiguity across different biotypes, the src_type would also be set to ambiguous.
o@sources["AACATTCATTGCTGTCGGTGGG",]
## location cigar src_name src_type overlap
## AACATTCATTGCTGTCGGTGGG ambiguous 22M mmu-mir-181b-5p miRNA 22
## length status
## AACATTCATTGCTGTCGGTGGG 22 unique
Importantly, ambiguous features are not simply discarded: if one of the read’s potential source RNA is differentially-expressed, it will most likely show up in the sum of their expression, and hence be picked up by the differential-expression analysis below. However, by flagging it as ambiguous (rather than, say, splitting reads across the features), we keep track of the fact that their actual source is uncertain.
(Note: by default, the assignment is multithreaded, using ncores()-1 threads.)
The default assignment rules are:
defaultAssignRules()
## $miRNA
## $miRNA$size
## [1] 20 21 22 23
##
## $miRNA$minOverlap
## [1] 12
##
##
## $primary_piRNA
## function (x)
## {
## checkPiRNA(x, type = "primary")
## }
## <bytecode: 0x56077a24aad8>
## <environment: 0x56076a6394b8>
##
## $secondary_piRNA
## function (x)
## {
## checkPiRNA(x, type = "secondary")
## }
## <bytecode: 0x560774160f78>
## <environment: 0x56076a6394b8>
##
## $tRF
## $tRF$gtRNAdbOnly
## [1] TRUE
##
## $tRF$types
## $tRF$types$tRNA_5p_half
## function (x)
## {
## x$pos_in_feature <= 1 & x$length %in% 30:34
## }
## <bytecode: 0x56076ee3c6c8>
## <environment: 0x56076a6394b8>
##
## $tRF$types$tRNA_3p_half
## function (x)
## {
## x$pos_in_feature >= (x$feature_length - x$length - 1) & x$length >=
## 38 & x$length <= 50
## }
## <bytecode: 0x5607884c6568>
## <environment: 0x56076a6394b8>
##
## $tRF$types$tRNA_5p_fragment
## function (x)
## {
## x$pos_in_feature < 5 & x$length < 30
## }
## <bytecode: 0x56076fa8c788>
## <environment: 0x56076a6394b8>
##
## $tRF$types$tRNA_3p_fragment
## function (x)
## {
## (x$pos_in_feature + x$length) > (x$feature_length - 5)
## }
## <bytecode: 0x56077bae1ba0>
## <environment: 0x56076a6394b8>
##
##
##
## $highPriorityTypes
## [1] "miRNA" "tRNA" "tRNAp" "Mt_tRNA"
## [5] "snRNA" "snoRNA" "antisense" "piRNA_precursor"
##
## $lowPriorityTypes
## [1] "precursor"
For instance, the check for primary piRNA (which looks at length and at nucleotide signature) is performed if and only if the read is assigned to a piRNA precursor; if the check fails (for instance because the sequence and its reverse complement both lack the signature), the read is considered originating from a “piRNA_precursor”, whereas if it succeeds it is considered as “primary_piRNA”.
After read assignment, different read counts can be accessed:
o@seqcounts stores the read counts for each individual sequenceo@agcounts stores the aggregated read counts for each unambiguous features. Importantly, this can include different ‘levels’ of features: for instance, if some reads can be assigned to mmu-miR-320-3p and mmu-miR-320-5p, and others can only be assigned to mmu-miR-320 (for instance because they are not the canonical miRNAs), the mmu-miR-320 entry in that table will be the aggregated of all three entries, and will figure alongside the first two.o@agcounts_ambiguous stores the aggregated read counts for ambiguous features, i.e. combinations of features from which reads are equally probable to originate. For instance, if some reads map equally well to both geneA and geneB, they will be aggregated in the entry “geneA;geneB” of that table.Importantly, all these read counts are not normalized. To get normalized counts for specific (sets of) features, see the getSeqCounts and getFeatureCounts functions (below). To get a complete normalized matrix, use normalizeCounts on your matrix of interest, e.g. normalizeCounts(o@agcounts, o@norm).
A set of functions allow the quick exploration of a smallRNAexp dataset, which are showcased here. Each function have a number of parameters allowing to, for instance, plot slightly different or more selective information (for instance narrowing down to a specific RNA biotype).
The two following functions summarize the alignment statistics:
plotAlignStats(o)
plotClipping(o)
The following function summarizes the read distribution across RNA biotypes:
plotComposition(o)
By default, unknown and reads whose biotype is ambiguous are excluded, and biotypes with a very small proportion of the reads are aggregated into ‘other’. To see all composition, you can tinker with the parameters of the plotComposition function, or simply query the whole composition table, in either case setting the exclude parameter to NA or c():
getComposition(o, exclude=c())
## CTRL1_GSM1214556 CTRL2_GSM1214557
## precursor 2.407598e-01 1.065914e-01
## Mt_tRNA 6.198491e-02 6.790921e-02
## Mt_rRNA 6.546527e-02 1.553151e-02
## ambiguous 1.111241e-01 8.687495e-02
## unknown 3.357353e-02 1.190647e-02
## lincRNA 2.596114e-03 1.151660e-03
## protein_coding 1.760084e-02 5.781801e-03
## miRNA 4.096185e-02 2.482151e-01
## piRNA_precursor 7.276036e-03 1.191753e-03
## tRNA_5p_fragment 1.053900e-01 7.506111e-02
## snRNA 4.596710e-04 1.562275e-04
## tRNA_3p_fragment 2.044809e-02 5.559212e-03
## tRNA_internal_fragment 7.857768e-02 8.000785e-02
## primary_piRNA 2.576063e-03 2.046166e-04
## sense_intronic 1.418614e-04 4.147634e-06
## processed_pseudogene 2.270785e-04 9.677812e-06
## antisense 1.505335e-03 3.566965e-04
## TEC 5.413792e-04 8.157013e-05
## snoRNA 1.228129e-04 4.175285e-04
## rRNA 1.273244e-04 1.520799e-05
## processed_transcript 7.960279e-04 3.497838e-04
## misc_RNA 8.200892e-04 5.087764e-04
## tRNA_5p_half 2.068179e-01 2.920860e-01
## transcribed_unprocessed_pseudogene 1.303320e-05 5.530178e-06
## unprocessed_pseudogene 1.453703e-05 5.530178e-06
## transcribed_processed_pseudogene 1.253193e-05 4.147634e-06
## bidirectional_promoter_lncRNA 1.754470e-05 2.765089e-06
## sense_overlapping 4.862387e-05 9.677812e-06
## CTRL3_GSM1214551 CTRL4_GSM1214552
## precursor 1.790918e-01 1.678824e-01
## Mt_tRNA 1.352729e-01 1.720984e-01
## Mt_rRNA 1.141868e-01 1.312991e-01
## ambiguous 9.452920e-02 6.079159e-02
## unknown 2.652069e-02 1.379524e-02
## lincRNA 1.004030e-03 8.646358e-04
## protein_coding 2.376898e-02 1.369957e-02
## miRNA 6.084019e-02 8.798280e-02
## piRNA_precursor 2.904314e-03 1.803211e-03
## tRNA_5p_fragment 6.492489e-02 4.870570e-02
## snRNA 4.671102e-04 3.455525e-04
## tRNA_3p_fragment 1.091279e-02 9.350742e-03
## tRNA_internal_fragment 5.756675e-02 5.767196e-02
## primary_piRNA 1.665871e-03 6.992534e-04
## sense_intronic 9.797803e-06 7.544815e-06
## processed_pseudogene 2.295135e-04 4.194917e-05
## antisense 4.217954e-04 3.340844e-04
## TEC 2.929543e-04 1.985795e-04
## snoRNA 1.109601e-04 1.807738e-04
## rRNA 8.891506e-05 4.919219e-05
## processed_transcript 2.469046e-04 2.215158e-04
## misc_RNA 9.530813e-04 1.003460e-03
## tRNA_5p_half 2.239528e-01 2.309432e-01
## transcribed_unprocessed_pseudogene 7.838242e-06 4.526889e-06
## unprocessed_pseudogene 7.838242e-06 6.035852e-06
## transcribed_processed_pseudogene 2.939341e-06 1.508963e-06
## bidirectional_promoter_lncRNA 8.083187e-06 8.450192e-06
## sense_overlapping 1.028769e-05 9.053778e-06
## MSUS1_GSM1214555 MSUS2_GSM1214553
## precursor 1.685497e-01 1.316291e-01
## Mt_tRNA 1.347170e-01 7.988652e-02
## Mt_rRNA 1.240192e-01 8.930328e-02
## ambiguous 6.887332e-02 8.858756e-02
## unknown 1.426212e-02 1.527658e-02
## lincRNA 1.059373e-03 7.070094e-04
## protein_coding 1.724985e-02 1.542859e-02
## miRNA 8.968248e-02 1.265782e-01
## piRNA_precursor 1.834614e-03 2.102027e-03
## tRNA_5p_fragment 4.788248e-02 5.954291e-02
## snRNA 4.346995e-04 6.634646e-04
## tRNA_3p_fragment 1.399376e-02 1.229785e-02
## tRNA_internal_fragment 7.589529e-02 8.543228e-02
## primary_piRNA 1.624316e-03 1.426950e-03
## sense_intronic 8.295791e-06 5.515676e-05
## processed_pseudogene 6.595154e-05 1.406629e-04
## antisense 5.483518e-04 4.058904e-04
## TEC 1.845814e-04 1.578170e-04
## snoRNA 3.845099e-04 1.340652e-04
## rRNA 1.389545e-04 7.257469e-05
## processed_transcript 3.177288e-04 2.686583e-04
## misc_RNA 8.768652e-04 1.792199e-03
## tRNA_5p_half 2.373492e-01 2.880798e-01
## transcribed_unprocessed_pseudogene 3.318317e-06 5.278159e-07
## unprocessed_pseudogene 4.977475e-06 1.055632e-05
## transcribed_processed_pseudogene 2.488737e-06 1.319540e-06
## bidirectional_promoter_lncRNA 8.710581e-06 9.500686e-06
## sense_overlapping 2.779090e-05 8.972870e-06
## MSUS3_GSM1214554
## precursor 1.698314e-01
## Mt_tRNA 6.286539e-02
## Mt_rRNA 9.650931e-02
## ambiguous 7.410346e-02
## unknown 2.135321e-02
## lincRNA 1.324912e-03
## protein_coding 1.323194e-02
## miRNA 7.597656e-02
## piRNA_precursor 2.235952e-03
## tRNA_5p_fragment 5.211981e-02
## snRNA 5.658859e-04
## tRNA_3p_fragment 1.425907e-02
## tRNA_internal_fragment 9.404690e-02
## primary_piRNA 7.434085e-04
## sense_intronic 5.205942e-05
## processed_pseudogene 1.384781e-04
## antisense 5.976421e-04
## TEC 2.576941e-04
## snoRNA 1.457664e-04
## rRNA 8.173329e-05
## processed_transcript 4.482316e-04
## misc_RNA 1.838739e-03
## tRNA_5p_half 3.172288e-01
## transcribed_unprocessed_pseudogene 1.509723e-05
## unprocessed_pseudogene 7.808913e-06
## transcribed_processed_pseudogene 4.685348e-06
## bidirectional_promoter_lncRNA 1.093248e-05
## sense_overlapping 5.205942e-06
In case there is substantial compositional variation across samples, as is the case here, the first thing to do is to understand whether this variation is related to the comparison of interest. In the composition PCA plot above, the first component was tested for statistical association with the sample’s group, and the same was done with the polar coordinates from the two combined components (mapped to colors on the plot). Since there is no significant association, the variation in composition is unlikely to be related to groups, and instead represents a ‘random’ variation that could potentially mask effects of interest. It is therefore desirable to include this variation in the statistical model.
To do so, one first needs to compress the compositional variability into one (or few) variable. How to do so will depend on the exact distributions: when the first principal component (PC) explains most of the variation, it can be directly used as a variable. When this is not the case, one can use the first 2 components, or the polar coordinates of the first 2, or a multidimensional scaling forced onto a single variable. All of this can be done manually, or with the help of the getCompositionComponent function of the package:
a <- getCompositionComponent(o, summarytype="mds")
a
## compositionV1
## CTRL1_GSM1214556 0.44146394
## CTRL2_GSM1214557 -1.00000000
## CTRL3_GSM1214551 0.36954270
## CTRL4_GSM1214552 0.28867146
## MSUS1_GSM1214555 0.20580308
## MSUS2_GSM1214553 -0.27185462
## MSUS3_GSM1214554 -0.03362656
Once we have this variable, we can add it to the phenoData, which will allow us to use it in the differential expression analysis:
o@phenoData <- cbind(o@phenoData, a)
The exclude parameter allows us to decide what kinds of RNA get into the composition plot. To plot tRNA fragment sub-composition, we can exclude everything but the tRNA-related RNAs, and use abridged=F to get more specific RNA types:
plotComposition(o, abridged=F, exclude=setdiff(row.names(o@composition$raw$all),tRNAtype()))
Abundances of different sequence lengths can be checked using the following function (see also the plotSizeAbundance function):
checkSizeSelection(o)
## Warning in ln[, 2] * max(y): NAs produced by integer overflow
The first row is before and after normalization. Each line in the plots represents a sample, except for the grey line in the background which indicates the relative proportion of unique sequences with the given length.
Small RNA seq libraries frequently have a GC-bias, which is not much of a problem for differential expression provided that it’s stable across samples. We can check this with:
plotGC(o)
Samples’ variation according to expression levels can be observed using MA-like plots (using, for each sample, the foldchanges to the mean of all samples):
plotMAs(o)
This can provide information as to the adequacy of the normalization method.
The smallRNA package relies on other packages (currently either edgeR, DESeq2 or limma/voom) to perform the differential expression analysis (DEA) on different types of features (e.g. specific sequences, aggregated counts, etc). It is recommended because it handles filtering, normalization (including specific normalization procedures described below), and (if the voom method is used) can use mixed models to handle incomplete replicates. The function can be called either using a formula (including elements of the phenoData slot) or specifying groups (see ?runDEA for details). Since we want to include compositional information, we will run it with a formula that includes it:
res <- runDEA(o, formula="~compositionV1+group")
The different analysis results are stored as different elements of a list:
lapply(res, names)
## $seqLevel
## [1] "miRNA" "tRNA" "primary_piRNA" "allKnownUnique"
## [5] "ambiguous" "unknown"
##
## $aggregated
## [1] "miRNA" "tRNA" "allUnique" "ambiguous"
##
## $info
## [1] "types" "norm" "DEA.call" "assignmentRules"
## [5] "phenoData"
Each element of the list (except for info) is a table that reports the log2FC, p-value and FDR/q-value of each feature:
head(res$seqLevel$miRNA)
## logFC logCPM LR PValue
## AGGTCTGTAGCTCAGTTG 6.374751 6.626553 8.293126 0.003979544
## AGGTCTGTAGCTCAGTTGGTTAGAGCG 2.970828 6.718845 7.826308 0.005149118
## AGGTCTGTAGCTCAGTTGG 6.296147 6.478628 7.649414 0.005679077
## ACTCTTAGCGGTGGATCACTCGG 3.011217 4.265150 7.529594 0.006069359
## TTTGTTCGTTCGGCTCCCGTGA 2.183189 4.686567 7.439492 0.006380788
## TGAGGTAGTAGGTTGTATGGTTTTT 2.401769 2.419412 7.142183 0.007529144
## FDR
## AGGTCTGTAGCTCAGTTG 0.7817182
## AGGTCTGTAGCTCAGTTGGTTAGAGCG 0.7817182
## AGGTCTGTAGCTCAGTTGG 0.7817182
## ACTCTTAGCGGTGGATCACTCGG 0.7817182
## TTTGTTCGTTCGGCTCCCGTGA 0.7817182
## TGAGGTAGTAGGTTGTATGGTTTTT 0.7817182
All of these results can be written to a multi-sheet excel file using the following function:
writeDEA(res)
We can get an overview of the results using the volcano plots:
volcano(res$seqLevel$tRNA)
Although in this case there’s very little to be viewed…
Normalized counts of sequences associated to a specific feature can be plotted using the following function:
plotFeature(o, "mmu-mir-425-5p", las=3)
## use default substitution matrix
(To plot un-normalized counts, simply use normalized=FALSE)
It is also possible to plot the aggregated count, and to fetch features matching a regex pattern. For instance, we can plot the read counts of all mmu-mir-196 related entries using:
plotFeature(o, "mir-196", exact=F, aggregated=T, las=3)
This could also be plotted as a heatmap (plotType=“heatmap”) or a heatmap of z-scores:
plotFeature(o, "mir-196", exact=F, plotType="zheatmap", las=3)
## use default substitution matrix
Finally, we could also plot the proportion of each sequences mapping to a feature in the following way:
plotFeature(o, "mmu-mir-425-5p", proportion=TRUE, las=3)
## use default substitution matrix
By default, when plotting multiple sequences of the same feature, multiple alignment is used to align the sequences in the legend, as shown above. This might not work well if the sequences are very different from one another, in which case this feature can be turned off with the alignSeqs=FALSE parameter.
See ?plotFeature for more details on other parameters.
Note: Both in these plots and in the DEA results, sequences that were soft-clipped during alignment will be shown in small letters.
By default, normalization factors are calculated on all mapped sequences (including sequences of ambiguous origins and sequences mapping to unnanotated regions of the genome), but this is not always the best option.
Samples can have a different RNA composition for a number of reasons: because of biological differences, or because of technical variation such as differences in the extraction efficiency. If the RNA composition differs due to, say, a technique’s unbalanced representation of nuclear and cytoplasmic components of the cells, normalizing on the entire RNA population could artificially create differences. In such a case, it might be advisable to concentrate on the fraction of interest and calculate normalization factor only on the basis of selected types of RNAs. This can be done with the normalizeOnType argument of the calcNormFactors.smallRNAexp function. For instance:
o <- calcNormFactors(o, normalizeOnType=c("miRNA","tRNA"))
would calculate normalization factors only based on miRNA and tRNA sequences, making the assumption that most of these are not differentially-expressed across samples.
The impact of different kinds of normalization can be assessed using the plotMAs and checkSizeSelection functions, both of which accept a type argument. For instance, we can see that in the current dataset, normalizing on either “miRNA” or “tRNA” leads to and over- or under-representation of other types of RNAs in one of the samples:
o <- calcNormFactors(o, normalizeOnType="miRNA")
checkSizeSelection(o, type="tRNA")
plotMAs(o, type="tRNA")
o <- calcNormFactors(o, normalizeOnType="tRNA")
checkSizeSelection(o, type="miRNA")
It is difficult to determine whether these differences are technical or biological, but in general it is commonly assumed that the best normalization procedure is the one that minimizes differences across samples.
calcNormFactors also implements the conditional quartile normalization (cqn) taking into account potential GC bias (requires the cqn package). The offsets generated are automatically used for differential expression analysis.
Like standard quantile normalization, CQN normalization is somewhat more ‘aggressive’, making samples more alike, as illustrated below:
o <- calcNormFactors(o)
checkSizeSelection(o)
## Warning in ln[, 2] * max(y): NAs produced by integer overflow
o <- calcNormFactors(o, method="cqn")
## fitting ...
##
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|=================================================================| 100%
## Using 'sigma' instead 'sig2' (= sigma^2) is preferred now
checkSizeSelection(o)
## Warning in ln[, 2] * max(y): NAs produced by integer overflow
For this reason it should be used with care. One circumstance where this is particularly useful is when there is a different GC-bias across samples, which can be monitored through the plotGC function of the package.